32 research outputs found

    Faster Robust Tensor Power Method for Arbitrary Order

    Full text link
    Tensor decomposition is a fundamental method used in various areas to deal with high-dimensional data. \emph{Tensor power method} (TPM) is one of the widely-used techniques in the decomposition of tensors. This paper presents a novel tensor power method for decomposing arbitrary order tensors, which overcomes limitations of existing approaches that are often restricted to lower-order (less than 33) tensors or require strong assumptions about the underlying data structure. We apply sketching method, and we are able to achieve the running time of O~(np1)\widetilde{O}(n^{p-1}), on the power pp and dimension nn tensor. We provide a detailed analysis for any pp-th order tensor, which is never given in previous works

    Attention Scheme Inspired Softmax Regression

    Full text link
    Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible next words or phrases, given a sequence of input words. This distribution is then used to select the most likely next word or phrase, based on the probabilities assigned by the model. The softmax unit plays a crucial role in training LLMs, as it allows the model to learn from the data by adjusting the weights and biases of the neural network. In the area of convex optimization such as using central path method to solve linear programming. The softmax function has been used a crucial tool for controlling the progress and stability of potential function [Cohen, Lee and Song STOC 2019, Brand SODA 2020]. In this work, inspired the softmax unit, we define a softmax regression problem. Formally speaking, given a matrix ARn×dA \in \mathbb{R}^{n \times d} and a vector bRnb \in \mathbb{R}^n, the goal is to use greedy type algorithm to solve \begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2. \end{align*} In certain sense, our provable convergence result provides theoretical support for why we can use greedy algorithm to train softmax function in practice

    Convergence of Two-Layer Regression with Nonlinear Units

    Full text link
    Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value

    Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

    Full text link
    Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let nn denote the length of sentence, let dd denote the embedding dimension. Given a matrix XRn×dX \in \mathbb{R}^{n \times d}, suppose dnd \gg n and XX<r\| X X^\top \|_{\infty} < r with r(0,0.1)r \in (0,0.1), then we aim for finding YRn×mY \in \mathbb{R}^{n \times m} (where mdm\ll d) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) - D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two results for this problem. \bullet Our first result is a randomized algorithm. It runs in O~(nnz(X)+nω)\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} ) time, has 1δ1-\delta succeed probability, and chooses m=O(nlog(n/δ))m = O(n \log(n/\delta)). Here nnz(X)\mathrm{nnz}(X) denotes the number of non-zero entries in XX. We use ω\omega to denote the exponent of matrix multiplication. Currently ω2.373\omega \approx 2.373. \bullet Our second result is a deterministic algorithm. It runs in O~(min{i[d]nnz(Xi)2,dnω1}+nω+1)\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega-1}\} + n^{\omega+1}) time and chooses m=O(n)m = O(n). Here XiX_i denote the ii-th column of matrix XX. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence

    Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention

    Full text link
    Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios

    Clustered Linear Contextual Bandits with Knapsacks

    Full text link
    In this work, we study clustered contextual bandits where rewards and resource consumption are the outcomes of cluster-specific linear models. The arms are divided in clusters, with the cluster memberships being unknown to an algorithm. Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources, and with the total consumption of any resource exceeding a constraint implying the termination of the algorithm. Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships. We provide an algorithm that achieves regret sublinear in the number of time periods, without requiring access to all of the arms. In particular, we show that it suffices to perform clustering only once to a randomly selected subset of the arms. To achieve this result, we provide a sophisticated combination of techniques from the literature of econometrics and of bandits with constraints

    Solving Tensor Low Cycle Rank Approximation

    Full text link
    Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic context-free grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an n×n×nn \times n \times n third order tensor AA, we say that AA has cycle rank-kk if there exists three n×k2n \times k^2 size matrices U,VU , V, and WW such that for each entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) } \end{align*} for all a[n],b[n],c[n]a \in [n], b \in [n], c \in [n]. For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous ``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank

    Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights

    Full text link
    In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data XRd×nX \in \mathbb{R}^{d \times n} from given attention weights W=QKRd×dW = QK^\top \in \mathbb{R}^{d \times d} and output BRn×nB \in \mathbb{R}^{n \times n} by minimizing the loss function L(X)L(X). This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data
    corecore